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ABSTRACT 

In this paper, we emphasize the need for data cleansing when 
clustering large-scale transaction databases and propose a 
new data cleansing method that improves clustering quality 
and performance. We evaluate our data cleansing method 
through a series of experiments. As a result, the clustering 
quality and performance were significantly improved by up 
to 165% and 330%, respectively. 
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USERID=37264: 

amusement park, cherry blossom, mall of america, 

entrance fee, disneyland 
USERID=93272: 
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USERID=20438: 

media player, skins, lyric words, download 
USERID=72620: 

major league, ichiro, baseball cap 



Figure 1: An example of transaction database. 



1. INTRODUCTION 

Data mining has been pursued since the 1990's, and clus- 
tering is an important technique in data mining. Clustering 
is finding the groups of objects having similar features, and 
it has been rigorously studied [U HI [6] , since it has a wide 
range of applications. Examples of the applications are tar- 
get marketing and recommendation services. The former is 
finding groups of customers having similar purchasing pat- 
terns and then establishing marketing strategies according 
to the patterns. The latter is presenting the products to the 
customer who is highly likely to purchase them according to 
his/her sales preferences. 

Recently, transaction databases have become a new target of 
clustering 1,3 . A transaction is defined as a set of related 
items, and a transaction database is a database consisting of 
the transactions obtained in an application [111 I12j . As an 
example, Figure [T] shows four transactions in a transaction 
database in the application of search engine services. Each 
transaction contains the search keywords issued in the same 
user's session. Another example of transaction database is 
the product purchase records at a big retail market such as 
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Wal-Mart. In that database, a transaction is defined as a 
set of products purchased by a customer at a time. 

Transaction databases have introduced a few technical chal- 
lenges. First, the objects handled in previous clustering 
algorithms were represented as d-dimensional vectors [2]. 
That is, they were represented as the points in d-dimensional 
space and were processed based on the Euclidean distance 
between them [3J. However, the transactions in transaction 
databases cannot be represented as (/-dimensional vectors; 
they are called categorical data [3] . Second, the size of trans- 
action databases is much larger than the dataset handled in 
previous algorithms 8 . While the size of dataset in previous 
algorithms is about several KBs to several MBs, transaction 
databases have sizes of several GBs up to several TBs. 

In this paper, we emphasize the need for data cleansing, 
which is a pre-processing step before clustering on transac- 
tion databases, and propose a new data cleansing method 
that improves clustering performance and quality. Previous 
clustering algorithms did not consider data cleansing pro- 
cess. In fact, transaction databases, such as search keyword 
databases, contain a lot of noise. For example, there are 
meaningless search keywords such as 'tjdnfeorhddnjs' that 
never appear more than once in the database. This sort of 
noise causes an increase of the number of useless clusters 
and the degradation of clustering performance and quality. 
A relevant idea is used in information retrieval and text min- 
ing. We explain the differences in detail in Section [2] 

This paper is organized as follows. In Section [21 we briefly 
explain the related work on clustering transaction databases. 
In Section [3J we explain the need for data cleansing and 
propose a new data cleansing method. In Section [4] we 
evaluate our data cleansing method through experiments. 
Finally, we conclude this paper in Section [5] 



2. RELATED WORK 

Most of previous clustering algorithms handled only data 
objects that can be represented as d-dimensional vectors. 
There are small number of clustering algorithms that han- 
dle categorical data or transaction databases, and the most 
representative one is the ROCK algorithm [3]. It was shown 
in [3] that we could only get unsatisfactory clustering re- 
sult on categorical data based on the Euclidean distance. 
Therefore, ROCK adopted Jaccard coefficient as a similarity 
measure between categorical data. However, since ROCK 
has the time complexity higher than 0(n 2 ), where n is the 
number of objects, it can hardly be applied to large-scale 
transaction databases. 

Efficient clustering algorithms on large-scale transaction data- 
bases have been proposed in [111 I12j . A new notion of large 
item has been proposed in [TT]. For a pre-specified support 
8(0 < 8 < 1) and a transaction item e, if the ratio of clus- 
ters containing e in a cluster d is larger than 8, the item 
e is defined as a large item in the cluster d; otherwise, it 
is defined as a small item. The clustering algorithm in [llj . 
which we call the LARGE algorithm in this paper, is ex- 
ecuted in the direction of maximizing the number of large 
items and simultaneously minimizing small items by trying 
to bring the same transaction items together in a cluster. 

The CLOPE algorithm [12], an improvement of LARGE, is 
also a heuristic algorithm and maximizes clustering qual- 
ity by iteration. The algorithm does not use the notion of 
large/small items; it proposed a more efficient measure for 
computing clustering quality. CLOPE algorithm was shown 
in [12] to have better clustering performance and quality 
than ROCK and LARGE through a series of experiments. 

The problems of LARGE and CLOPE are as follows. The 
algorithms did not consider the effect of noise data and as- 
sumed that the number of result clusters k is very small. 
However, in actual transaction databases, there contained a 
lot of noise data with very low frequencies, and the number 
of result clusters is fairly close to the number of transactions 
n. As a matter of fact, k should be highly variable depend- 
ing on transactions in the database and items contained in 
the transactions. If k is very small compared with n, the 
average number of transactions in a cluster should be very 
high, and such large clusters should have little practical use- 
fulness. LARGE and CLOPE have the time complexity of 
0(nk), which approaches 0(n 2 ) as k approaches n. 

In a broad sense, a text database or a document database 
can be regarded as a form of transaction database; a term 
and a document correspond to an item and a transaction, 
respectively. However, these databases have a few essential 
differences from the transaction databases as the following. 

First, since the primary application of text databases is, 
given a query term, finding and ranking relevant documents, 
the relevance metrics and feature selection methods are de- 
fined between a term and a document [7] [9]. However, in the 
transaction database, we use the similarity metrics defined 
between transactions since we are interested in the relation- 
ship between transactions. When clustering documents us- 
ing relevance metrics such as tf-idf, we should compute the 
relevance value for each combination of a term and a docu- 



ment, and then we generate feature vectors for each of the 
documents [9], which causes severe performance degrada- 
tion. This preprocessing cost becomes larger when dealing 
with a larger size of text databases. However, when cluster- 
ing transactions using inter-transaction similarity metrics, 
we do not need the preprocessing step of generating feature 
vectors, and the clustering performance is not severely af- 
fected by the size of transaction databases. This advantage 
of inter-transaction similarity metrics over term-document 
relevance metrics is more significant when dealing with a 
frequently updated database. When the database is up- 
dated, the entire feature vectors in text database should be 
re-generated, which is totally unnecessary in the transaction 
databases. 

Second, most transaction databases do not allow duplicated 
items in a transaction, while any number of same terms can 
appear in a document in text databases. This causes some 
relevance metrics useless in transaction databases. For ex- 
ample, for an item i and a transaction T, the term frequency 
is 1/|T|, where |T| is the cardinality of T, i.e., the number 
of items in T, and the inverse document frequency is always 
identical. Hence, the tf-idf value between i and T is de- 
pendent only on the cardinality of T; the transaction T of 
smaller size is regarded to be more relevant to i, which is 
nonsense. 

Third, although removing some high frequency and low fre- 
quency terms is effective in text databases, the detailed pro- 
cedure is very different from that in transaction databases. 
They should be very cautious when removing unnecessary 
terms in text databases; the terms should not be removed 
only due to their frequencies, and it is true for both high fre- 
quency and low frequency terms. For example, in the world 
movie database, the term 'ponyo' should not be removed 
only because it appears very rarely, since there should be 
a lot of people that are interested in the Japanese anima- 
tion "Ponyo on the Cliff." Removing unnecessary terms in a 
majority of text databases is controlled under human super- 
vision, which means that it can hardly be fully automated. 
However, the transaction database has no such issue, and 
removing unnecessary items can be fully automated. In this 
paper, we propose a new fully automated data cleansing 
method with minimal parameter settings and show its effec- 
tiveness through experiments. 



3. DATA CLEANSING 

In this section, we explain the need for data cleansing and 
propose a new data cleansing method that improves clus- 
tering performance and quality. Our data cleansing method 
decides the usefulness of items according to their frequencies 
in transactions. Figure [2] shows the item frequencies in two 
real-world transaction databases. The horizontal axis repre- 
sents item frequencies, and the vertical axis represents the 
number of items. As shown in the figure, there exist a lot of 
items whose frequencies are very small. The two transaction 
databases are explained in detail in Section [4] 

Transaction items with too low or too high frequencies have 
negative effects on clustering performance and quality. We 
explain the phenomenon with examples. We use the same 
similarity measure between transactions as ROCK as the 
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Figure 2: Item frequencies in two real-world trans- 
action databases. 



following Eq. JT]): 



i(n,T 2 ) = 



|Tinr a | 

|Ti UT 2 



(1) 



where the denominator represents the number of whole items 
(without duplication) contained in transactions Ti and Tb, 
and the numerator represents the number of items com- 
monly contained in Ti and T 2 . 

First, we explain the effect of the items with too low fre- 
quencies. Assume that similarity threshold 9 between trans- 
actions is given as 9 = 0.5. Consider three transactions Ti = 
{abcxyz}, T% = {bcdpqr}, and T3 = {acdstuvw}. Then, for 
every transaction pair T and Tj (i 7^ j, 1 < i,j < 3), it 
holds that sim(Ti,Tj) < 9, and hence the transactions Ti, 
T2, and T3 does not form a cluster. However, by removing 
the items with very low frequencies (i.e., xyzpqrstuvw), Ti, 
T 2 , and T 3 become T{ = {abc}, T 2 = {bed}, and T3 = {acd}, 
respectively. Since, for every transaction pair T( and Tj, it 
holds that sim(T(,Tj) > 9, three transactions T[, T 2 , and 
T3 should form a useful cluster. In fact, we can easily find 
enormous number of such transactions as Ti, T 2 , and T3 in 
real- world transaction databases. The problem due to low 
frequency items cannot be solved by adjusting or lowering 
the threshold 9, because the number of low frequency items 



is not constant across transactions and hence the threshold 
cannot be fixed. 

Second, we show an example where clustering quality is de- 
graded due to the items with too high frequencies. Consider 
four transactions Ti = {abedxy}, T 2 = {cdxyzw}, T3 = 
{qrxyzw}, and T4 = {opqrzw}. Since, for every transaction 
pair Ti and T i+1 (1 < % < 4), it holds that svm(Tl,T-) > 9, 
it is highly likely that the transactions Ti, T 2 , T3, and T4 
should form a large useless cluster Cl = {Tl, T 2 , T3, T4}. 
However, by removing the items with very high frequen- 
cies (i.e., xyzw), Ti, T 2 , T3, and T4 become T{ — {abed}, 
T 2 = {cd}, T3 = {qr}, and Ti = {opqr}, respectively. The 
transactions T[, T 2 , T3, and T4 naturally form two useful 
clusters d = {Tl,T^} and C 2 = {T3, Tj}. Similarly to 
low frequency items, there are enormous number of trans- 
actions such as Ti, T 2 , T3, and T4 in real- world transaction 
databases, and the problem due to high frequency items can- 
not be solved by adjusting or raising the threshold 9. 

We assume that the item frequency shown in Figure[2]should 
follow the lognormal or the exponential distribution [lUj . 
Based on this assumption, our data cleansing method per- 
forms as the following. First, in the transaction database, 
we count the number of transaction items for each item fre- 
quency (a positive integer value). Next, using the (item 
frequency, count) pairs, we estimate the parameters such as 
mean fi and standard deviation a for the lognormal or the 
exponential distribution. Finally, for a pre-specified param- 
eter s, we remove all the items whose frequencies are either 
less than (/i — sa) or greater than (fi + scr). After removing 
such items, we also remove empty transactions whose items 
have been entirely removed. In most cases, s should be 3 ~ 
5. 

In the case of lognormal distribution, the estimates for two 
parameters fi and a are obtained using the following Eq. ([2]): 

Si=i.. n lna;< ~ 2 £»=i..n (lnii — fi) 2 

fl = , a = , (2) 

n n 

where n is the number of transaction items, and Xi repre- 
sents item frequency. If there are k items whose frequencies 
are Xi, then Xi appears k times in Eq. 

In the case of exponential distribution, we compute the es- 
timates for two parameters fx and a using the following 
Eq. ©: 



, 1 . 2 1 
/& = — , a = — 
A A 2 



where the estimate A is computed as the following: 



A — — , X — — Si— L.nXi 

x n 



(3) 



(4) 



Choosing which of two distributions for a specific transac- 
tion database is highly dependent on human expert's view. 
In our experiments, while choosing any of two distributions 
contributed to the improvement of clustering quality and 
performance, the lognormal distribution was more effective. 
Moreover, improper selection of parameter s value could re- 
sult in worse clustering performance and quality. Larger 
s values were advantageous for the lognormal distribution, 
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while smaller s values were advantageous for the exponential 
distribution. 

Our data cleansing method can improve the quality of in- 
complete clustering results. CLOPE cannot always achieve 
complete clustering; actually, in most cases, its clustering 
results are incomplete. In such cases, our method helps im- 
prove clustering quality as well as clustering performance. 

4. EVALUATION 

In this section, we evaluate our data cleansing method through 
a series of experiments. For our evaluation, we implemented 
CLOPE [12] and executed it using real-world transaction 
databases. We compared clustering quality and performance 
between two cases: case (1) using our data cleansing method 
and case (2) without using it. In case (1), the target trans- 
action databases are pre-processed by our data cleansing 
method and then clustered by CLOPE, while, in case (2), 
the databases are directly clustered by CLOPE. 

As explained in Section [21 CLOPE is a heuristic algorithm 
that enhances clustering quality by iteration. The algorithm 
computes quality measure called profit of the intermediate 
clustering result at every iteration, and it stops when the 
profit does not increase any more. In our evaluation, we use 
the final profit as the clustering quality measure. 

CLOPE receives repulsion r(> 0) as an input parameter. 
Repulsion is a real value for controlling inter-cluster simi- 
larity; higher repulsion implies tighter similarity. Repulsion 
plays the analogous role of threshold 9 parameter given to 
ROCK and LARGE, and by adjusting repulsion, we can 
control the number and quality of clusters. 

It was justified experimentally in [12] that, by using the 
profit as a metric of clustering quality, CLOPE was more 
effective than the previous algorithms. In the experiment, 
CLOPE was run on the mushroom dataset which contains 
human classification information on poisonous and edible 
mushrooms. CLOPE achieved the accuracy of 100% for the 
repulsion r > 3.1. 

We used two datasets for our evaluation: (a) AOL search 
query database and (b) keyword registration database. The 
AOL database consists of about 20M queries issued by about 
650K users from March 1 through May 31, 2006. The database 
is a list of records, and every record consists of five fields 
AnonID, Query, QueryTime, ItemRank, and ClickURL. The 
first three fields AnonID, Query, and QueryTime represent 
anonymous user ID, search keyword by the user, and times- 
tamp when the query was issued, respectively. The fields 
ItemRank and ClickURL are optional, and they appear when 
the user clicked on any item in query result; they represent 
the rank and URL of the item clicked by the user, respec- 
tively. The keyword registration database is a transaction 
database; each transaction consists of a URL and a list of 
registered keywords. The same keyword can be registered 
by multiple URLs. When a query on a certain keyword is 
issued, the URLs that registered the keyword are shown in 
the query result. 

We transformed AOL database into a transaction database 
in the form shown in Figure Q] for clustering by CLOPE. 



Since a record in AOL database has a query at one time, a 
user's search queries are spread into multiple records, which 
appear adjacently in the AOL database. The queries by the 
same user are collected and a record (transaction) is formed 
in the transaction database. 

We used the user-id field (AnonID) when transforming AOL 
dataset into a transaction database. A transaction in the 
transaction database shown in Figure[T]contains all the query 
terms of the same user-id. The query terms of the same user- 
id are collected into one transaction, and different transac- 
tions have different user-ids. Hence, the inter-transaction 
similarity based on user-id becomes always zero. We be- 
lieve that the recommender systems should undergo similar 
procedures. 

The settings for our evaluation are as follows. We used a 
PC equipped with Intel Core2Quad Q9550 2.83GHz CPU, 
4GB RAM, and 600GB HDD and implemented programs 
using GNU C++ 4.1.2 on CentOS Linux 5.4 64bit Edition 
with Kernel 2.6.18. We set repulsion for CLOPE as r — 
1.5, which is a largest value permitted by our system. We 
assumed that the number of transaction items follow the 
lognormal distribution and set s — 5. 

Figure[3]shows the result of the first experiment using (a) AOL 
database; it compares clustering quality and performance 
between the cases (1) and (2). In case (2), for the number of 
transactions 50K, our program was terminated abnormally, 
which is most likely due to lack of main memory and swap 
space. As shown in the figure, clustering quality and perfor- 
mance was improved by applying our data cleansing method 
for every number of transactions. The improvement ratio of 
quality and performance reached up to 165% and 330%, re- 
spectively. In case (1), much smaller number k of clusters 
were formed by CLOPE under the same settings. For that 
reason, since CLOPE has 0(nk) time complexity, we could 
gain the improvement of clustering performance. 

We performed the second experiment using (b) keyword 
database with the same settings as the first experiment, and 
the result is shown in Figure [4] As in Figure [3] cluster- 
ing quality and performance was also improved by applying 
our data cleansing method for every number of transactions. 
The improvement ratio of quality and performance reached 
up to 115% and 166%, respectively. 

The third experiment was performed for two distributions 
and a few parameter s values. We used (a) AOL database 
used in the first experiment, and the number of transactions 
was set as 10K. The experiment result is shown in Figure [5] 
With the lognormal distribution, clustering quality and per- 
formance converge to a point for s values larger than or 
equal to 4.0. This means that there is no improvement in 
clustering quality and performance by our data cleansing 
method. With the exponential distribution, smaller s val- 
ues were advantageous for improving clustering quality and 
performance. 

5. CONCLUSIONS 

In this paper, we emphasized the need for data cleansing as 
a pre-processing step before clustering large-scale transac- 
tion databases and proposed a new data cleansing method 
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Figure 4: Comparison of clustering quality and per- 
formance using keyword database. 



that improves clustering quality and performance. As the re- 
sult of our evaluation on our data cleansing method through 
experiments, the clustering quality and performance were 
significantly improved by up to 165% and 330%, respec- 
tively. Although our evaluation was performed by CLOPE, 
we believe that other clustering algorithms such as ROCK 
and LARGE should profit by applying our data cleansing 
method. 
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